Zero-Shot Temporal Action Detection via Vision-Language Prompting
نویسندگان
چکیده
Existing temporal action detection (TAD) methods rely on large training data including segment-level annotations, limited to recognizing previously seen classes alone during inference. Collecting and annotating a set for each class of interest is costly hence unscalable. Zero-shot TAD (ZS-TAD) resolves this obstacle by enabling pre-trained model recognize any unseen classes. Meanwhile, ZS-TAD also much more challenging with significantly less investigation. Inspired the success zero-shot image classification aided vision-language (ViL) models such as CLIP, we aim tackle complex task. An intuitive method integrate an off-the-shelf proposal detector CLIP style classification. However, due sequential localization (e.g., generation) design, it prone error propagation. To overcome problem, in paper propose novel zero- $$\underline{S}$$ hot $$\underline{T}$$ emporal $$\underline{A}$$ ction via Vision- $$\underline{L}$$ anguag $$\underline{E}$$ prompting (STALE). Such design effectively eliminates dependence between breaking route propagation in-between. We further introduce interaction mechanism improved optimization. Extensive experiments standard video benchmarks show that our STALE outperforms state-of-the-art alternatives. Besides, yields superior results supervised over recent strong competitors. The PyTorch implementation available https://github.com/sauradip/STALE .
منابع مشابه
Zero-Shot Detection
As we move towards large-scale object detection, it is unrealistic to expect annotated training data for all object classes at sufficient scale, and so methods capable of unseen object detection are required. We propose a novel zero-shot method based on training an end-to-end model that fuses semantic attribute prediction with visual features to propose object bounding boxes for seen and unseen...
متن کاملZero-shot Cross Language Text Classifica-
Labeled text classification datasets are typically only available in a few select languages. In order to train a model for e.g news categorization in a language Lt without a suitable text classification dataset there are two options. The first option is to create a new labeled dataset by hand, and the second option is to transfer label information from an existing labeled dataset in a source la...
متن کاملZero-Shot Learning via Latent Space Encoding
Zero-Shot Learning (ZSL) is typically achieved by resorting to a class semantic embedding space to transfer the knowledge from the seen classes to unseen ones. Capturing the common semantic characteristics between the visual modality and the class semantic modality (e.g., attributes or word vector) is a key to the success of ZSL. In this paper, we present a novel approach called Latent Space En...
متن کاملZero-Shot Learning via Visual Abstraction
One of the main challenges in learning fine-grained visual categories is gathering training images. Recent work in Zero-Shot Learning (ZSL) circumvents this challenge by describing categories via attributes or text. However, not all visual concepts, e.g ., two people dancing, are easily amenable to such descriptions. In this paper, we propose a new modality for ZSL using visual abstraction to l...
متن کاملZero-Shot Recognition via Structured Prediction
We develop a novel method for zero shot learning (ZSL) based on test-time adaptation of similarity functions learned using training data. Existing methods exclusively employ source-domain side information for recognizing unseen classes during test time. We show that for batch-mode applications, accuracy can be significantly improved by adapting these predictors to the observed test-time target-...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Lecture Notes in Computer Science
سال: 2022
ISSN: ['1611-3349', '0302-9743']
DOI: https://doi.org/10.1007/978-3-031-20062-5_39